Audiovisual source separation

نویسنده

  • Bertrand Rivet
چکیده

Blind source separation (BSS) can be seen as a generalization of denoising a noisy signal when several sensors are available. Each of them records the same physical phenomenon in a different way: such a diversity is then useful to separate the present signals for instance by independent component analysis (ICA) or sparse component analysis (SCA) [1]. The main objective of speech separation/extraction is to mimic the ability of a human to separate multiple sound sources from their sound mixtures using a machine, i.e. computer-based solution of the so-called cocktail party problem. This problem was coined by Colin Cherry in 1953 [2], who first asked the question: “How do we [humans] recognize what one person is saying when others are speaking at the same time?". Despite being studied extensively, it remains a scientific challenge as well as an active research area. A main stream of effort made in the past decade in the signal processing community has been to address the problem under the framework of convolutive blind source separation (CBSS) where the sound recordings are modeled as linear convolutive mixtures of the unknown speech sources [1]. In the last decades, a lot of unimodal algorithms has been developed, i.e. operating only in the audio domain. However, as is widely accepted, both speech production and perception are inherently audio-visual processes which involve information from multiple modalities e.g., [3]. On the one hand, the production of speech is usually coupled with the visual movement of the mouth and facial muscles. On the other hand, looking at the lip movement of a speaker (i.e. lip reading) is helpful for listeners to understand what has been said in a conversation, in particular when multiple competing conversations and background noise are present simultaneously, as shown in an early work in [4]. In this direction, i.e. integrating visual information into an audio-only speech source separation system, are emerging as an exciting new area in signal processing: multimodal (audio-visual) speech separation [5].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Blind Audiovisual Source Separation Using Sparse Redundant Representations

In this work, we present a method that jointly separates active audio and visual structures on a given mixture. This new concept, the Blind Audiovisual Source Separation (BAVSS), is achieved by exploiting the coherence existing between the recorded signal of a video camera and only one microphone. An efficient representation of audio and video sequences allows to build robust audiovisual relati...

متن کامل

Audiovisual Programs As Sources Of Language Input: An Overview

Audiovisual devices such as satellite and conventional televisions can offer easy access to authentic programs which are considered to be a rich source of language input for SLA (Second Language Acquisition). The immediacy of various audiovisual programs ensures that language learners’ exposure is up-to-date and embedded in the real world of native speakers. In the same line, in the present pap...

متن کامل

Audiovisual Programs As Sources Of Language Input: An Overview

Audiovisual devices such as satellite and conventional televisions can offer easy access to authentic programs which are considered to be a rich source of language input for SLA (Second Language Acquisition). The immediacy of various audiovisual programs ensures that language learners’ exposure is up-to-date and embedded in the real world of native speakers. In the same line, in the present pap...

متن کامل

Audiovisual speech source separation: a regularization method based on visual voice activity detection

Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g. lip parameters tracking) with source separation methods to improve and/or simplify the extraction of a speech signal from a mixture of acoustic signals. In this paper, we present a new approach to this problem: visual information is used here as a voice activity detector (VAD). Results show that, ...

متن کامل

Further experiments on audio-visual speech source separation

Looking at the speaker’s face seems useful to better hear a speech signal and extract it from competing sources before identification. This might result in elaborating new speech enhancement or extraction techniques exploiting the audio-visual coherence of speech stimuli. In this paper, we present a set of experiments on a novel algorithm plugging audio-visual coherence estimated by statistical...

متن کامل

Developing an audio-visual speech source separation algorithm

Looking at the speaker s face is useful to hear better a speech signal and extract it from competing sources before identification. This might result in elaborating new speech enhancement or extraction techniques exploiting the audiovisual coherence of speech stimuli. In this paper, a novel algorithm plugging audio-visual coherence estimated by statistical tools on classical blind source separa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013